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ABSTRACT 

PD? 9 1 several papers commissioned by the Nat ionai 
Inst itute of Education (NIE| , this discussion addresses the utility 
°^ _*®?t-equat ing s ^?^i? s _^ 9* -i**^" scale program evaluation and 
investigates the use of standardized tests in similar evaluation 
efforts; The discussion begins with ah examination 6£ the question, 
Should standardized achievement tests be used in a programmatic 
evaluation of Follow Through? Predominant issues in the argument 
against the use of standardized tests in program evaluation. are then 
addressed. It is concluded that the earlier practice of evaluating 
Follow Through with a single standardized achievement test should be 
abandoned. The second question addressed poses the issue of whether 
NIE_shpuld_spbrispr another large-scale_test-equat irig study in support 
of Follow Through program evaluation. Consideration of this issue is 
based largely on experience gained through the administration of the 
Anchor Test Study (ATS j . Following a brief description of ATS 
history, the discussion draws on an extensive review of the 
literature to suggest that ATS results have not been considered. 
Finally, implications of the review for Project Follow Through are 
presented together with (1) a set of recommendations concerning the 
use of standardized tests in the evaluation of Follow Through, and 
( 2 ) recommendations about _NIE sponsorship of a major test-equating 
study involving tests that might be suitable for Follow Through 
program evaluation r (RH) 
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INTRODUCTION AUO OVERVIET-T 

This paper is one of t?;^^ commissioned by the national Institute of Education in 
support cf planning for the te^tin^ of hgw approaches in the Follow Through program* It 
is one of a number of papers that Follow Through planers categorized as "supporting 
research;" intended to consider fund -mental methodological and analytic issues that will 
likely impinge on the piarm A ng; design; operation and evaluation of the Follow Through 
program in the next few ye-rs» More specifically, I was asked to address two issues! the 
utility of test equating for large-scale program evaluation, and the use of standardized 
tests in large-scale program evaluation* 

In responding to the NIE charge, I have chosen to focus my attention more closely on 
the Follow Through program thin the HIE- planning staff may have intended* But the issues 
treated in this paper are clearly pertinent to large-scale evaluations of many 
instructional programs; especially those intended for educationally disadvantaged 
students* 

The paper begins with an examination of the question; Should standardized 
achievement tests be used in a programmatic evaluation of Fallow Through? There is 
abundant evidence of divergent views on this issue in the methodological literature* 
Advocates can be found among both supporters and critics of the longitudinal evaluation 
of Follow Through that culminated in the Abt Associates reports. Those who advise against 
the use of standardized achievement tests in the evaluation of any program for 
disadvantaged students (or, perhaps, in any program evaluation) include such pillars of 
the measurement community as Ralph Tyler (1972; 1978) and the late Oscar Kriseri Euros 
(1977)* Three issues appear to predominate in the argument against the use of 
standardized tests in program evaluation; (i ! the design of the tests makes them 
insensitive to instruction and more suitable to their original purpose of distinguishing 
hormatively among the achievement levels of individual students; (2) no single 
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standardized taaz can be used to validly examine students' achievement of the 
instructional objectives of the variety of projects, sponsors, and approaches present in 
virtually ail fs derally-supported education programs* Lack of content congruence rhaRee 
standardized teats inappropriate as instruments for program evaluation! (35 the political 
baggage attached to standardized tests results in wholesale ove'rinterpretation of the 
findings reported For tne.tii Standardized tests have been oversold to the public and to 
educational policy makers to the degree that they overshadow any alternative measures of 
educational impact* regardless of their appropriateness or validity* The only prudent 
course therefore is to avoid the use of standardized tests altogether in large-scale 
program evaluations* Judgments on these issues are documented and discussed briefly* 

Although the first and the third arguments against the use of standardized 
achievement tests in program evaluation can only be supported by avoiding such use 
altogether, the second argument, concerning content validity, might be handled 
responsively by using a variety of standardized achievement tests instead of only one* 
Presumably, if each sponsor or developer of a project were allowed to select the 
standardized achievement test that most closely matched the content of his or her 
instructional model, measurement validity would be materially improved* However, scores 
on a variety of achievement tests, whether in raw-score or derived-score form can 
appropriately be aggregated for evaluation of program effects only if the tests have been 
properly equated* 

This leads to the second major question addressed in this paper: Should NIE sponsor 
another large-scale test-equating study in support of Follow Through program evaluation? 
Consideration of this issue is based largely in the experience gained through the first 
large-scale test-equating study supported by the federal government — the Anchor Test 
Study* sponsored by the U* 5* Office of Education* The history of the Anchor Test Study 
is reviewed briefly, with particular attention to its intended use as a tool in federal 
evaluation of the effectiveness of the Title I* ESEA program; Utilization of the results. 
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of the Anchor Test Study is reviewed next* A thorough search of reie/arit literature was 
conducted to determine whether*/ arid to what degree, Anchor Teat Study results and data 
have been used in program evaluation or in measurement and evaluation research* Finally* 
the implications of this review for the Follow Through program are presented, together 
with a set of recommendations on the use of standardized tests in the evaluation of 
Follow Through and on N*E sponsorship of a major test-equating study involving tests that 
might be suitable for Follow Through program evaluation. 

SHOULD STANDARDIZED ACHIEVEMENT TESTS EE USED 
IN AN EVALUATION GF THE FOLLOW THROUGH PROGRAM? 

Arguments against the use of standardized achievement tests in program evaluation 
are not new, In 1972, when commenting on the suitability of standardised achievement 
tests for use as assessment devices, Ralph Tyler stated: 

"...the exercises have not been obtained by a systematic sampling of 
what children are expected tn learn. Instead, the exercises comprise a 
sample of items that differentiate children. It may seem odd ♦ ♦ ♦ 
but this is due to the fact that most psychometrists since World War I 
have been interested in individual differences among children and in the 
process of sorting children rather than in the process of learning." 
Tyler expanded on his position in a reaction to a paper by Koepfner, presented at a USOE- 
sponsored conference, on achievement testing of disadvantaged and minority students for 
educational program evaluation in 1976 (see Tyler, in Wargo and Green, 1978). Tyler's 
remarks are central to the question at hand, so it is worthwhile quoting Rim at length! 
"the confinement of the selection [of achievement tests for program 
evaluation 3 to contemporary norm- referenced achievement tests stultifies most 
of the possibilities for valid and accurate appraisal of the outcomes of 
educational pro grams. We can learn very little sbout the strengths and 
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weaknesses of programs of 'compensatory education* 'or those designed for 
children of minority groups* from the results of these tests. At best* they are 
rough and imprecise measures, and most probably they are invalid/' 

"The first critical assumption made by psychological examiners is that the 
purpose of the test is to measure individual differences and to arrange those 
who take the test in a continuum from the Best to the poorest. The purpose of 
program evaluation is to determine how many pupils have learned what the 
program seeks to teach/ and the amount learned, rather than to separate pupils 
as much as possible. Furthermore, psychological examiners assume that the 
population of test takers should form a normal distribution similar to the 
distributions of some of their physical characteristics like height and weight. 
In contrast, educational programs are designed to help all pupils learn such 
things as to read, compute* write and understand scientific principles." 
Tyler goes on to suggest that the central purposes of the developers of standardized 
achievement tests leads them to concentrate on items that are near the 50 percent 
difficulty level, eliminating items that are suitable to the assessment 01 disadvantaged 
students and students who are most able. He further suggests that differences between 
widely used curricula force test makers to sample behaviors that are largely learned out 
of school, rather than in school, in their attempt to build tests that are universally 
applicable. Severe problems of invalidity are said to result. 

Oscar Krisen Euros, in a reminiscence on fifty years of measurement history, also 
commented critically on the use of standardized achievement tests in program evaluation. 
His views mirror those of Tyler! 

"[standardized achievement tests are] harmful to the development of 
the best possible measuring instruments. . .. It seems inescapable that 
such methods . . . insidiously tend to strengthen the status quo. to 



impede curricula*' progress; to perpetuate our present grade classification, to 
differentiate rather than to measure, conceal Unlearning; and to give an 
illusory sense of continuous learning from grade to grade" (1977)* 
Speaking in reaction to a paper by William Coffman at the conference dh achievement 
testing of disadvantaged and minority students referenced above, Jaeger (1973) also 
advised against the use of standardized achievement tests in evaluating education 
programs for disadvantaged students! 

"The content and skills to be measured by commercially available 
standardized tests are determined through expert judgments of what typical 
students in specific grades should know, when exposed to widely used basic 
skills curriculum materials available for these grades* Are minority and 
disadvantaged students to be considered typical students? Clearly not, for if 
they were we would not have specially designed programs to meet their special 
needs* Are the curriculum materials used in these programs typical of those 
used with students in these grades? Logic would again tell us that the answer 
is no* For if standard curriculum materials were used, there would be no need 
for special programs* Bo the very process By which' standardized achievement 
tests are planed, if the process follows the ideal* threatens the content 
validity of these tests for uses that are the subject of this conference* If 
the content coverage of the tests corresponds to typical curricula, and not to 
curricula used in the programs to be evaluated, the evaluator may well be 
judging- the status and progress of disadvantaged children oh material they have 
hot had the opportunity to learn*" 

These various judgments on the appropriateness of standardized achievement tests far 
educational program evaluation are supported by recent research on the effects of the 
congruence between test content, and curriculum or instructional content) on student 
achievement* This literature is well reviewed by Tittle (1930), but several relevant 
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findings are note i here. Jenkins and Fsny (1976) condacted a carefdl analysis of the 
overlap in vocabulary between five standardized reading achievement tests recommended by 
their publishers far use in grades one or two, and seven basal reading texts, recommended 
by their publishers for use in the same grades. The authors estimated the grade 
equivalent scores that would be earned by students who mastered all of the words in a 
given reader, and then answered correctly, all of the items bri a given test that 
pertained to words common to the reader and the test. The most extreme variation in 
scores for a given grade-one reading book was a low grade-equivalent score of i'.O to a 
high grade-equivalent score of 2.3. The range of extremes for secrnd-grade books was a 
low grade-equivalent score below 1*0 if one test was used to a high grade-eqivalent score 
of 3.4 if another test was used* For the Metropolitan Achievement Word Knowledge Test, 
grade-eqivalent scores ranged From a low of 1.9 For one reading book to a high of 2.5 For 
another, at the second-grade level* The range for other tests was typically much greater. 
In short, the Jenkins and Fany study shows that the content of the achivement test used 
to evaluate a program is critical to the resulting estimate of its effectiveness. In the 
evaluation of a program like Follow Through, where a' large variety of curricula are 
purposefully included, some of those curricula must necessarily suffer selection Bias if 
a single test is used for evaluation of student achievement. These conclusions are also 
consistent with the results of investigations by Armbruster, Stevens & Rosenshine 
(1977) i Chang and Raths (19715, Schutes (1969), Cooley and Leinhardt (1978), Hoepfner 
(1973) and Eianchini (1973). The latter two studies are disojssed in the next section of 
this paper. 

Although federally-supported education programs typically have broad lists of 
objectives that include alleviation of social* economic and educational deprivation, 
judgments of the success or Failure of these programs have often been grounded in 
students' performances on standardized achievement tests. The Follow Through program if a 
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case in point* Schiller, Stalfbrd, Rudner, Kdcher and Lsshick (1?S0) descabe Follow 
Through as a "Federal educational assistance program designed to provide comprehensive 
services to children from low income families and to increase understanding about 
effective practices in educating these children" Cp«2)t In elaborating on the meaning of 
"comprehensive services," they include health, social, and other support services* in 
addition to educational services* 

Schiller, et ah summarize the principal findings of a nine-year evaluation of the 
national Follow Through program as follows** 

"There was more variability in outcomes within models from site to 
site than there was between models; 

Models that emphasized basic skills produced more gains in those areas and 
in self concept than other models*, 

Overall, there was little difference observed in the performance of Follow 
Through and nan-Follow Through children* Both groups of youngsters remained 
substantially below national norms"(p »3)# 
Each of these findings concerns students' performances on pehcil-and-paper assessment 
instruments* and most refer to their pt=. formances on the Metropolitan Achievement Tests* 
These judgments of the merits of the Follow Through program depend on a narrow range of 
outcome measures, with standardized achievement t;*sts heading the list of those measures* 
It should be noted that Schiller, et al# accurately reflect the emphases given to various 
Follow Through outcome measures in the Abt Associates reports on the longitudinal 
evaluation (Cline, Ames, Anderson, Bales, Ferb, Jor.ii, Kane, Larson, Park, Proper, 
Stebbins £? Stern, 1974) to (Stebbins, St Pierre, Proper, Anderson & Cerva, 1977), as well 
as in subsequent discussions of those reprts in the educational research literature 
(House, Glass, McLean & Walker; 1°7S; Anderson, St Pierre, Proper & Stebbins. 197S; 
Wislerj Burns S: Iwamoto, i°73); 

It is interesting to note that all of to the papers concerned with the accuracy of 
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the longitudinal evaluation of the Follow Through program that were presented in the May* 
197S issue of the Harvard Educational Review endorse the "use of the Metropolitan 
Achievement Test as an outcome measure* House* et al. protested the labeling of that test 
as a "basic skills" measure* suggesting that it assessed a Far narrower range of skills* 
which they labeled the "mechanics of reading and arithmetic/ 1 However, they did not 
protest the use of a standardized achievement test in the national Follow Through 
evaluation, except on the grounds of insufficiency! 

"The coverage of outcome domains is so poor that no judgment of best model 
can legitimately be made, no matter how large the difference in test scores" 
(p. 156), 
And also on page 1561 

"Even if dependable differences were found on the MAT, such differences 
would be inadequate evidence of which model is best* Follow Through was to be 
an investigation of models of comprehensive early childhood education — hot 
just reading, not just arithmetic/ not just language usage* An attempt was made 
to measure more than a few narrow scholastic outcomes but that attempt was not 
successful. It serves ho one well to proceed as if it had been. Although who 
did best on the MAT might be a valid question* it would be wrong to confuse 
that question with the one that was actually asked," 

In their response to the House, et ah criticisms, the principal Abt Associates 
evalaators of the Follow Tnroogh program (Anderson, et al., 1973) cite the main 
conclusions of their report, which they claim their critics ignored. One citation is 
paticolariy telling, in that it illustrates the primacy of the Metropolitan Achievement 
Tests in the Follow Through evaluation, and the way in which the language of evaluative 
reporting can go well beyond the limited scope of the d^ta collected. This citation fuels 
the argument of those who suggest that standardized achievement tests not be used in 
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large-scale program evaluations because their results will be bveririterpreted. Although 
they are basing their conclusions solely on HAT scbres f Anderson, et al< state! 
"With few exceptions j Follow Through groups were still scoring 
substantially below grade level at the end of three or Four years' intervention 
(Bock, Stebbiris £ Proper, 1?77, passim) • Fbbr children still- tend to perform 
poorly in schUol even after the best and the brightest theorists — with the 
help of parents, Ideal educators, arid Federal funds, and supported by the full 
range of supplementary services associated with community action programs ™ 
have done their best to change the situation/' 

In their reaction ta the House, et al. criticism of the Follow Through evaluation, 
the USOE program officers who supervised the Abt Associates work (Wisler, et al», 1978) 
make an interesting claim to the validity of the MAT as a measure of Follow Through 
effectiveness! "We agree that many model-specific objectives were not measured in the 
national evaluation, but the goal was to gather valid data on a common set of outcomes 
generally considered important/ 1 

The validity arguments made by Tyler, and Surds were apparently ignored by the 
contributors to the Harvard Educational Review papers* Although the latter authors debate 
the adequacy of the MAT as a measure of Follow Through effectiveness, none of them seems 
to consider the possibility that it was selectively biased against some or most of the 
Follow Through models* Both sets of defenders of the evaluation base their case oh the 
importance of the "content measured by the MAT. and ignore the possibility that that 
content might riot have been a part of the curricula of the Follow Through projects that 
were evaluated ► 

Wisler, et al»'s validity claim for the MAT embodies a misuse of the term. The data 
collected in the Follow Through evaluation are neither valid nor invalid, It is the 
conclusions and inferences put forth on the basis of those data that must be examined for 
validity; What appears to be invalid is the conclusion that the Follow Through program 
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failed just because the MAT scares were low, 

In the final analysis* ths use of standardized achievement tests in large-scale 
program evaluation'} and in particular* in the Follow Through evaluation* is a matter of 
judgment* The positions of those who oppose such uses cf standardized tests is 
increasingly supported by evidence on the differential content validity of widely-used 
tests when applied to early primary level programs in the basic skills. On the basis of 
that evidence, I would recommend that the earlier practice of selecting a single 
standardized achievement test for overall evaluation of the Follow Through program be 
abandoned* 

Whether a number of standardized achievement tests should be used in a national 
Follow Through evaluation, after they have been properly equated? is a separate question 
that is examined in the balance of this paper* 

. SHOULD FOLLOW THROUGH SPONSOR 
ANOTHER MAJOR TEST-EQUATING STUDY? 

If data collected using several standardized achievement tests were to be aggregated 
in a national evaluation of the Follow Through program* it would first be necessary to 
equate the tests used in ths evaluation* Such an undertaking would involve a major 
investment of funds and time, and might not be feasible economically or technically* The 
merits of a test-equating study as a tool for Follow Through evaluation can perhaps best 
be explored by considering in detail the objectives, results, and utilization of the 
first large-scale test-equating study supported by the federal government; the USOE- 
sponsored Anchor Tent Study* The next section of this paper contains a brief review of 
the history of the Anchor Test Study; and a detailed review of the utilization of Anchor 
Test Study results in program evaluation and in measurement and evaluation research* 
These reviews are used as the basis of recommendations Sri the advisability of NIE 
sponsorship of a te=.t-equatirig project in support of Follow Through evaluation, 
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THE ANCHOR TEST STUDY 



HISTORY, UTILIZATION, AMD IMPLICATIONS FOR FOLLOW THROUGH 

A Brief History 

In the late 1960'Si the UtS« Office of Education (USOE) conducted several large-scale 
surveys for the purpose cf securing information that would be useful in judging the 
operation and impact of Title I of the Elementary and Secondary Education Act of 
i'?65(E3EA)* Students' performance on standardized achievement tests* particularly in 
reading and mathematics, was adopted as a primary indicator of the direction of program 
services to students who were most educationally disadvantaged* In addition! changes in 
performance on standardized achievement tests, from the beginning of a school year to the 
end, were to Be used as f primary indicator of programmatic success or failure in 
alleviating the effects of economic deprivation and educational disadvantagement. 

At the time these surveys were initiated , state departments of public instruction 
and many large school systems resisted the collection of any uniform achievement test 
data by the U. S. Office of Education. The adverse political impact of the Survey on 
Equality of Educational Opportunity (Coleman, et al*, 19665 was still being felt, and 
most Chief State School Officers were wary of ay data that would permit the comparison of 
students' achievement test performances in different states* As a result of this 
apprehension, it was decided that use of a common achievement test in Title I evaluation 
surveys was politically irireasible* 

The first USOE Survey on Compensatory Education (1°6SJ requested that school systems 
provide achie /ement test scores in reading and mathematics* from their existing records, 
for students in grades two, four, and si:c, at the beginning and end of the 19£3 school 
year* The reported scores for tens of thousands of students included perofrmances on more 
than 400 combinations of tests-levelsi and Forms* administered in the fall and spring oF 
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|9£3i However* seven major achievement tests accounted for about 90 percent of the score 
reported in that year. 

The ac v ievement :•: ~ cores reported by school systems were essentially useless For 



no singie publisher': : ::s were used with a sampie of students- that was remotely 
representative of the ■;. :puiation being served By Title I ; E3EA* The temptation to convert 
scores on different tests to a common derived scale — such as grade equivalent scores or 
T-scores or percentile ranks — was ignored when analysis of the content of the tests and 
the publishers' norms revealed substantial differences along both dimensions* One mitji't 
argue that the Metropolitan Achievement Tests and the Iowa Tests of Basic Skills* say, 
assess reading comprehension using somewhat similar exercises* and that scores on hoXX 
tests would be highly correlated were both to be administered to a large randomly- 
selected sample of fourth-graders* However* the sampling methods used by the publisher 
of these tests in developing their national norms were appreciably different* as were the 
cooperation rates of different types of school systems invited to participate in their 
test hormirigs* As a result, the "national" norms reported by these publishers could not 
be considered equivalent* And derived scores on these two' tests* or on any others, could 
not legitimately be aggregated for purposes of Title I evaluation* 

In 1969, the Bureau of Elementary and Secondary Education in USOE supported a study 
of the feasibility of equating scores on the reading comprehension and vocabulary 
subtests of five major test batteries* Expert judges developed a content classification 
system for the testa and "assigned each item on every test to a content category, in an 
attempt to estimate the congruence of the tests* In addition; triples of tests were 
administered to several thousand students ih the District of Columbia and the states 
surrounding Washington, D« C* so that correlations among corresponding subtests could be 
estimated* The judges' attempts to estimate the content similarities of corresponding 
subtests were hot success ; ul« Their lack of agreement on the categorization of items from* 



purposes of Title I e- 
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ce the data were obtained, it was quickly realized that 
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a given subtest was so great that their estimates of the congruence of different subtests 
were suspect* However, the correlations among corresponding subtests in different test 
batteries were sufficiently high Cat least in the high SO's when disattienuated) that a 
major test-equating study was judged to be feasible. 

Early in l?71j Educational Testing Service was awarded a $7QQ;O0Q contract to 
conduct ah equating and ^standardization study involving the seven most widely Used 
reading comprehension and vocabulary subtests in grades four, five, and six* The project 
hJTie to" be known as the "Anchor Test Study" because of the equating methodology employed 
It involved ^standardization of the reading comprehension and vocabulary subtests of the 
Metropolitan Achievement Tests intended for use with students in grades four, five and 
six, using a carefully selected stratified sample of public and non-public elementary 
schools chosen from counties throughout the United States* More than 500,600 students in 
these grades provided useable data for restandardization of these subtests* A second part 
of the Anchor Test Study was an equating study that produced tables of score 
correspondence between the reading comprehension and vocabulary subtests of the 
California Achievement Tests, the Comprehensive Tests of Basic Skills, the Iowa Tests of 
Basic Skills, the Metropolitan Achievment Tests, the Sequential Tests of Educational 
Progress, the SRA Achievement Series, the Stanford Fweading Tests, and in a supplementary 
study, the Gates-McGinitie Reading Series* Nearly 155,000 students provided useable test 
scores for the equating portion of the study conducted in the Spring of 1 C; 72, and an 
additional 14)000 students provided useable test scores the following spring for the , 
supplementary equating of the Gates MacGihitie test to the other seven tests* 

Additional details on the design of the Anchor Test Study and its results can be 
found in the thirty-volume final report on the project (Lo>et* Seder, Bianch'irii and Vale, 
1972)* in the three-volume supplementary report (Loret, Seder, Bianchini and Vale* 1973), 
and in review articles By Linn (1973) and by .Jaege>- 119735. 
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Although the methodological soundness or the Anchor Test Study is uncontestable! one 
cobid probably find a variety of views on its ultimate value. Ite direct federal cost of 
three-quarters of a million dollars pales in comparison to the fifty million dollar 
federal expenditure for evaluation of the Follow Through prog^arm However, its objectives 
were far narrower; And its findings created far less excitement and controversy in public 
and policy circles. 

The Anchor Test Study clearly established the feasibility of equating the reading 
cmprehension and vocabulary subtests of different test batteries, even though they were 
not designed to be psychometrically parallel. Prior to completion of the study, it was 
not certain that the tests to be equated were similar enough make equating possible. In 
theory, parallel tests can be equated and non-parallel tests cannot (Angoff , 1971)* Tests 
that differ in difficulty, length, reliability, and the constructs they assess will hot 
generally exhibit a consistent relationship across samples of different composition, To 
be equatable, it is often suggested that a pair of tests have a disattenuated 
intercorrelation of at least 0.95> a value that is similar to the inter-form correlations 
of many achievement tests used in the elementary grades. 

Correlations among the subtests equated in the Anchor Test Study were typically in 
excess of the 0.95 criterion. Of 139 correlations between pairs of subtests equated at 
levels appropriate to students in grades four, five, and six, 106 (56 percent) were in 
the range 0.98 to 1 .001 53 (25 percent) were in the range 0.95 to 0*97) 26 (14 percent) 
were in the range 0.92 to 0.94; and only 4 (2 percent) were in the range 0.S9 to 0.9h 
Thus 84 percent of the subtest correlations met the admittedly arbitrary criterion of 
0.93. In addition, the standard errors of equating achieved through the Anchor Test Study- 
were consistently less than one-half of a raw-score paint at all score levels above the 
chance scores on the subtests being equated (Loret, Seder, Eianchini and Vale, 1972). 
Equating errors of this magnitude are generally smaller than test publishers have 
realized when they equated alternate forms of their tests that were designed to be 
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parallel. 

Utilization of Anchor Test Study Findings 

Use in Program Evaluation* although the Anchor Test Study was intended primarily to 
provide a tool for use in evaluating Title I. E3EA at the national level (and perhaps at 
state levels as well), there is little evidence that it has been Used for this purpose* 
In 1979, Stonehiii and Fishhein presented a methodological paper on the aggregation of 
achievement gains in Title I evaluations that referenced data on the comparability of 
achievement test results produced through the Anchor Test Study* They concluded that the 
normal curve equivalent scale developed as a part of the Title I Evaluation and Reporting 
System did not reflect a common score metric* and thus did not produce equivalent 
achievement scores for students administered different tests* Neglecting this judgment; 
USQE fostered the adoption of regulations on local Title I evaluation that incorporate 
the models recommended in the Title I Evaluation and Reporting System* 

The conclusions advanced By Stonehiii and Fishbein are based in part on a paper by 
Jaeger (197?) that examined the consistency of achievement gains in the normal ourve 
equivalent metric that would be realized using various reading achievmEent tests equated 
in the Anchor Test Study, Jaeger concluded that the aggregation of achievement test 
results in the KCE metric would incorportate measurement errors that were likely to 
exceed the true gains typically found in Title I evaluations* 

Vale and Eianchini (1973/ used Anchor Test Study data in completing an analysis of 
the policy implications of various distributions of federal funds to school systems* In 
particular, they provided a basis for establishing eligibility criteria for participation 
in the then-proposed Better Schools Act, using relationships between students' 
performances on Anchor Test Study tests* certain family background variables (such as 
parental income category) and certain school system variables (such as degree of 
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Urbahism). Although this application of Anchor Test Study findings is riot strictly 
evaluative, it does fit within the broad framework of federal education program analysis, 

In i?74, J aeger presented a paper on the use of Anchor feat Study results in federal 
and statewide evaluation of Title I at the Annual Meeting of the American Educational 
Research Association, The paper enumerated some methodological possibilities;, but was 
based more on conjecture and Fond hope than on experience, Its subsequent impact on 
federal Title I evaluation is not demonstrable; 

Apart from these four papers, a thorough serach of the ERIC data base on such 
descriptors as evaluation methods, federal programs, equated scores, compensatory 
education programs, achievement tests, and achievement gains failed to produce ay 
evidence that the Anchor Test Study has been used either by the federal government or by 
state agencies in the evaluation of any federally-supported education program, IF the 
Anchor Test Study has had any significant impact on evaluation or measurement practice or 
theory, it is clearly apart from its direct use as a tool in program evaluation. 

Use in Educational Research and Assessment, Results from the Anchor Test Study have 
been used in a variety of educational research and assessment projects, ranging from 
methodological research on measurement and analysis of data to studies of the correlates 
of achievement, Goulet, et al, (19755 used the Anchor Test Study data to examine the 
severity of problems encountered in measuring achievement change, the feasibility of 
vertical test equating, and the stability of measurement constructs over time, in an 
extensive study of methodological problems in longitudinal reseach, supported by the 
.National Institute of Education. Because some test forms used in the Anchor Test Study 
were recommended by their publishers for use with students in mor-> than one of grades 
four, five and six, it was posssible to examine the consistency of vertically equating 
relationships between levels of other tests; vecommended for use in only one of these 
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Data collected in the Anchor Test Study included a number of descriptors of the 
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schools and classrooms of students participating in the study, as well as descriptors of 
individual students. Principals provided information on the socio-economic composition 
of the attendance ar-as served by their schools and on the degree of urbanism of their 
schools' attendance areas. Teachers provided information on class size and the use of 
ability grouping* as well as descriptive information on individual students who 
participated in the study, scch as IS! range ; race* and primary language used in the 
student's home* Two studies (Burgdorf, 1976? Doucette and St* Pierre, 1977) made use of 
these data to examine some of the correlates of reading achievement in the upper primary 
grades. Burgdorf used Total Reading scores on the Metropolitan Achievement Test level 
administered to more than £3,000 fifth-graders to construct an extensive series of cross 
tabulations; He examined as many as three of the ten descriptive variables used in the 
Anchor Test Study in relation to distributions of Metropolitan Total Reading scores* In 
a similar study supported by the National Center for Educational Statistics, Doucette and 
St. Pierre (19775 found that school variables such as urban ism of school location) type 
of school support (public vs. private^ socioeconomic composition of the student body, 
and percentage of minority enrollment were clearly related to reading achievement as 
measured by the Metropolitan Achievement Testi Likewise* individual student variables 
such as reported ID range, race or echnicity* primary langua/e spoken in the student's 
home) and teacher's diagnosis of the existence of a reading problem were significant 
correlates of reading achievement. However, the two classroom variables studied, class 
size and presence or absence of ability grouping, were not found to be related to reading 
performance on the Metropolitan Achievement Test* 

Rasp and Stiles (1974 J arid Rasp (19745 reported the results of a two-year experience 
in using the Anchor Test Study equating tables in conjunction with the Washington State 
Assessment Frog rami Instead of requiring the aministration of a common reading 
achievement test throughout the state, the State Department of Education attempted to 
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develop a profile of reading perfov-mance for its fiscal year 1973 ESEA Title III heeds 
assessment plan by analyzing sixth-grade achievement data routinely collected by a 20 
percent sample of school systerr.Si As might be expected* not all school systems used the 
test batteries* levels and forms equated in the Anchor Test Study, and problems of 
sampling bias were encountered . With National institute of -Education support? Rasp 
conducted a later study in which he developed computer programs. that would apply the 
Anchor Test Study norm tables to data collected in statewide assessments* Experience 
gained in using Anchor Test Study results to aggregate statewide reading achievement data 
was applied to the development of generalizable guidelines for such applications* 

In a National Institute of Education-supported study completed in 1930, Linn, et ali 
used Anchor Test Study data to investigate the possibility that content and format 
characteristics of reading comprehension items were consistently related to differences 
in item characteristic functions for students classified by race, This study of bias in 
reading comprehension items employed eight subgroups of students classified by grade 
level (fifth and sixth), income level of school attendance aeas (low and other), and race 
(black and white); A normative basis for judging meaningful differences in item 
characteristic curves was established by observing functional differences for groups of 
the same race that differed in grade level and income level of school attendance area» 
Unfortunately, the items that were identified as racially biased were not consistently 
different from other items in either format or content. Nonetheless, the Anchor Test 
Study provided a large. and apprbr 'ate database that enabled the authors to characterize 
item bias in a unique way and to examine substantive correlates of item bias, 

Use in Equating Research, A considerable amount of research on the methodology of 
test equating has been completed using data from the Anchor test Study* Because the 
Anchor Test Study data tapes contain extremely detailed in r ormation (in addition to 
demographic information, the raw data tapes identify the option chosen by every student 
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in response to every test item attempted), item characteristic curve- equating models as 
well as methods that employ toual test scores can be applied to the data set, 

Rentz and Bashaw (1973J i'^775 conducted an extensive reanalysis of all data 
collected in the ^standardization and equating portions of the Anchor Test Study, using 
the Rasch one-parameter item response model* They estimated the Rasch difficulty of each 
reading comprehension and vocabulary item in the seven tests used in the original Anchor 
Test Study* and established common reference scales for vocabulary subtests and for 
reading comprehension sub" tests- Once the tests were placed on a common reference scale, 
Rentz and Bashaw determined corresponding ! raw scores on the vocabulary subtests of 
different test batteries, and corresponding raw scores on the reading comprehension 
subtests of different test batteries* 

The Rentz and Bashaw study has important methodological implications for subsequent 
equatihgs of standardized achievement tests, Because the Rasch model purportedly provides 
"sample free" item calibrations* the reprentativeness of examinee samples used in the 
development of equating functions should not he a critical concern, as it is when 
classical equating procedures are used* It is also possible that sample size requirements 
will be somewhat smaller, since a more explicit relational model is being used* 

Unfortunately, the results of the Rentz and Bashaw study were equivocal* For some of 
the subtests equated in the Anchor Test Study, the classical equating functions and the 
Rasch-determiined equating functions were nearly identical over most of their score 
scales* For other subtest pairs, the difference! between classical and Rasch equating 
functions were three or more raw-score points, ah amount that is substantial when viewed 
as a component of bias error that will not diminish as a function of sample size* Even 
more perplexing is the question of which results are "correct" or ■'true**' If one adopts 
the classical definition of equivalent scores scores tha-: correspond to the same mid- 
percentile rank in any sample of examinees ™ the classical equating function must be 
viewed as a standard* Conversely* if one defines as equivalent two scores that correspond 
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to the same Risen ability level* the Easch results must be viewed as a standard* For the 
morrientj trie classical definition of equivalent scores is more widely accepted* 

Slinde and Linn l\''^7\ l ~>73) used Anchor Test Study data to examine the feasibility 
of constructing consistent vertical equating tables for corresponding subtests in 
different levels of a test battery* They alio examined the utility oF the P.asch model in 
constructing vertical equating tables, They concluded that vertical equating of reading 
achievement tests was hazardous, regardless of the analytic method employed, and that use 
of different test levels in studies of achievement gain should be avoided if possible* 
Their findings also have implications for out-of-level testing* a practice that is common 
in evalatioris of compensatory education programs* Aggregation of achievement test data 
across levels of tests can lead to the incorporation of sizeable measurement errors* 

With the support of the U. S* Office of Education* Eianchini and Vale (1975) 
examined the applicability of Anchor Test Study equating tables to groups composed of 
black or Spanish-sUrnamed students* They searched for evidence of interactions between 
equating relationships and the racial/ethnic composition of subgroups Upon which they 
were based* Fortunately, no systematic relationships were found) and isolated evidence of 
ah equating function by race/ethnic group interaction was attributed to relatively small 
black and Spanish-surnamed representation in the Anchor Test Study sample (leading to 
larger random equating errors), rather than to consistent racial/ethnic bias error* 

Beard and Fettie fi97*j used the Rentz and Bashaw (1973) reanalysis of Anchor Test 

Study data as a benchmark in judging the degree of Rasch fit of items in the Florida 

« yj -t m _ 

Educational Assessment tests in communications and mathematics administered to third- 
graders and fifth-graders* Their primary reaserch focus was a cdmparision of the results 
of classical linear equating and Easch equating of test forms used in 1976 and 17'77* They 
concluded that items in the Florida Asssessment tests fit the P.asch model to a greater 
extent than did items in the standardized achievment tests used in the Anchor Test Study* 
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It is interesting to note that they also four id close car respond ante between the results 
of linear equating and Rasch equating, suggesting (logically) that model fit may be 
< critical when equating \az >: using the Risch model; 

Implications of the Anch: -jst Study for Follow Through Research 

From the review of LleratUre reported above, it is clear that Anchor Test Study 
results have been used very little in large-scale program evaluation. Despite the 
supposition that the Anchor Test Study would Facilitate collection of achievement test 
data that, could be aggregated across projects, school systems* and states so as to 
provide bases for examining the targeting of services and the impact of services 
supported under Title I f ESEA, a thorough search of the ERIB system did not produce any 
supporting evidence. : 

Perhaps it is not surprising that the Anchor Test Study has contributed so little to 
state and federal evaluation of Title I, ESEA in view of the virtual abandonment of the 
large-scale survey approach to Title I evaluation at state and federal levels. At the 
time the Anchor Test Study was conceived, the U. S. Office of Education collected uniform 
data on the structure, organization, and operation of hundreds of Title I projects, as 
well as uniform information on the backgrond, characteristics, and participation of 
thousands of students. More than a few states emulated the federal approach to Title I 
evaluation. More recently, the Title I Evaluation and Reporting System has emphasized 
provision of data by local school systems using a common format* but allowing the use of 
any basic skills achievement measures that can be related to tests that have national 
norms. In effect, school systems have been encouraged to use criterion-referenced 
measures* and to equate these measures to nationally standardized tests through loosely 
controlled local equating studies. Much of the rhetoric of the measurement and evaluation 
community has served to relegate standardized tests to. second-class status as instruments 
for use in program evaluation. Truly represetnative no^ms for the reading comprehension 
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arid vocabulary subtests of the Metropolitan Achievement Tests, and tables of score 
correspondence between these subtests of the Metropolitan and those bF seven other test 
batteries understandably hold less currency than they once did in the minds of 



As noted above* the research uses of Anchor Test Study data; have tar outpaced use of 
the study by evaiuatorsi ft is not unreasonable to conclude that the Anchor Test Study 
has led to a resurgence of interest in reasearch on test equating* Certainly, the 
research literature on test equating has grown at a far faster rate since the Anchor Test 
Study was completed than in the eight year period prior to publication of its final 
report* And a good bit of the empirical research on test equating completed in the last 
eight years has made use of the Anchor Test Study data tapes* 

We have also noted the extensive use of Anchor Test Study equating tables and data 
tapes in secondary analyses and applications ranging from studies of the correlates of 
reading achievement to investigations of test item bias* The sheer size of the data base, 
in addition its nationally representative structure, has permitted the creation of large 
sub samples of examinees, classified on such variables as s'ex. race, IQ-level, and 
language usage* Such sub samples are issential to much empirical research bh item bias and 
the correlates of achievement, thus supporting the conclusion that the Anchor Test Study 
greatly Facilitated this research* 

In view of the history of usage of Anchor Test Study results and data, it is 
reasonable to ask whether a similar study would be of material value either in the 
evaluation of the Follow Through program or as a part of the research in support of 
future Follow Through approaches envisioned by Schiller* et al* (1980). Each of these 
questions will be addressed separately* How could the results of a large-scale test 
equating study be used in Follow Through evaluation? Speculation on the usefulness of a 
test equating study in Follow Through evaluation must begin with the presumption that 
standardized achievement test results will, once again, constitute a primary indicator of 
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program effects. It should be clear that this presumption does not constitute a 
recommendation i 

If standardized achievement tests ae hot Used to assess program impact, they might 
still provide a useful indicator of the characteristics of recipients of program 
services, or of the papulation of potential recipients* But this limited use of 
standardized achievement tests in Follow Through evaluation probably would not warrant 
the same attention to equivalence of measures as would use of such tests for assessment 
impact. 

If a number of tests of basic skills suitable for use with children in kindergarten 
through grade three were to be equated successfully* the obvious advantage would be 
greater flexibility in the selection of tests for evaluation of basic skills programs at 
those grade levels* Potential benefits include savings in time and money, and increased 
measurement validity. 

As noted above, two of the five generic program evaluation questions identified by 
Edruch and Cord ray (1980); "Who is served by the program?" and "Who needs services?" 
could be answered in part through standardized achievement test resits. Just as it is 
common to describe recipients and potential recipients of program services in terms of 
age distribution, racial and ethnic background, socioeconomic level, sex? urbanism of 
school* and grade ievelnachievemerit status in the basic skills areas is another common 
descriptor, To secure such information* a standardized achievement .est battery is 
typically administered to all program participants and, perhaps, to all students in 
selected grades in school systems that participate in a program* Often these same school 
systems administer one of a small number of standardized achievement test batteries at a 
part of their routine testing programs* As a result, students in the target grades of the 
compensatory program are subjected to two testing sessions, with consequent loss of 
instructional time,, and demonstrably reduced motivation to perform well on th?, tests. 



Were subtests in reading arid mathematics from a nUmber of widely used achievement 
test batteries to be equated successfully at levels suitable for use in kindergarten 
through grade three, future Follow Through evaluations might avoid special 
administrations of such te^ts for purposes of describing program participants? students 
in comparison g roups i and potential program participants. Achievement test data already 
available in the archives of participating school systems could be translated to a common 
reference scale, and aggregated to form achievement distributions for all groups of 
interest. 

The other obvious application of standardized achievement test data in Follow 
Through evaluation is in response to Boruch and Cor dray's question "What are the effects 
of services on recipients?"; Again, it is possible that equating of basic skills subtest^ 
at levels appropriate for use in kindergarten through grade three wbUld allow the use of 
achievement test data already available in school systems' files to answer this question* 
with an attendant savings in testing time and testing cost. Since most Follow Through 
eligible school systems probably receive funds through Title I, ESEA, it is not 
unreasonable to expect that they routinely test some, if not all, of their students in 
the early elementary grades at the beginning and end of each school year, 

Perhaps the greatests benefit of an equating of early elementary basic skills tests 
would be the possibility that Follow Through model sponsors could select one of a number 
of achievement tests For use in evaluating their models* The potential for increasing 
measurement validity is real and important, as evidenced by an increasing body of research 
on the effects of congruence Between curriculum content, instructional content, and 
achievement tests. Although a variety of subtests carry the label "reading compehension 
test" or "arithmetic concepts test", it has become increasingly clear in recent years 
that they do not ail measure the same thing* Eianchini (1973) addresses this problem in 
a paper on the appropriateness of differentiated norms for the evaluation of programs for 
disadvantaged and minority students* He cites the results of an analysis by Carder \i??6) 

26 



of the finding that 65 percent of first-grade students iri California scored below the 

first quartiie on the national norms of the Stanford Reading Test when that test was used 

in an evaluation of the Miiier-Urirbh Reading Program in The CaliFornia state 

legislature had budgeted the compensatory reading program on the basis of the reasonable 

expectation that about one-fourth of California's first-graders would score below the 

first quartiie of the national norm distribution, and the disproportionate finding caused 

considerable reaction. In the midst of a variety of hypotheses on the reasons for the 

poor showing by California's first-graders — ranging from an analysis of the 15! 

distribution and racial composition of the sample- used in the Stanford norms; to 

speculation about the excessive difficulty of 5tan r ord Reading Test items — Colder 
conducted an analysis of the congruence of the vocabulary assessed by the test and the 

vocabulary used in the state-provided instructional materials for first-graders; Ke found 

that the overlap was only V~' percent » Dn the basis of this finding alone, one cannot 

claim a causal realtionship. Hbwsvs'j Eianchirii completed a subsequent analysis of the 

vocabulary of First-grade readers used in California in 1971 and the vocabulary assessed 

by the Reading Tesf of the Cooperative Primary Test adopited by the state in that year for 

evaluation of the Killer-Unrub Reading Program* Ke found an overlap of 55 percent* 

Correspondingly, the performance of California's first-graders essentialy matched that of 

the national norms sample on the Cooperative Primary Tests* The medians' matched 

perfectly, whereas the California median was at the thirty-eighth percentile of the 

Stanford Reading Test norms in i?66; Bianchinj concludes! 

"The point is that in any program at the early grades it is particularly 

important for all children that the test content be related to instructional 

content* The reason for this is that children within the early grades learn 

only within the bounds of the curriculum they experience/ 1 

further evidence on the need to consider the congruence between curriculum content 
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and test content in program evaluations wa : ^*provided by HoepFner (19735* After developing 
a taxonomy oF content categories for reading test^ and mathematics tests. Hoepfner 
categorized ai: items in the reading and mathe nntics subtests of the eight standardized 
achievement test series tha~ were most widely used in Ke found that the subtests 

differed widely in their content emphases, despite their common titles* For example, at 
the levels recommended by their publishers for use with first-grade students, the 
percentage of items assessing mas:ery of word attack skills varied from a low of zero to 
a high of 60 percent in the reading subtests Hoepfner reviewed* In assessing recognition 
of word meanings (termed "vocabulary" in some- tests^, the percentage of items at t ^te 
first-grade level varied from zero to 5?' percent. And assessment of reading comprehension 
at the first-grade le/el was the function of 14 percent of the items in one subtest, of 
53 percent of the items in another, and of widely varying percentages between these 
extremes in the remaining six* 

Hoepfner found similar content differences in his review of mathematics subtests 
For example, knowledge of numbers and sets was assessed By only two percent of the items 
contained in the second-grade level of one test and by 24 percent of the items contained 
in another teat intended for second-graders* Whole-number computation was the objective 
of 60 percent of the items in one test intended for second-graders? but was not assessed 
at all by the items in the second-grade mathematics test of another battery* 

In an analysis that was similar to Hoepfner's but involved multiple judges in the 
classification of items. Porter, Schmidt, Floden and Freeman (1978) found that the 
distributions of items in the mathematics subtests of standardized achievement batteries 
intended for use with fourth-graders differed substantially across objectives and 
content* For example, they found that operations with single-digit numbers were 
represented in only two percent of the items in one batte-'y, but in 20 percent of the 
items in another* Problems involving whole numbers constituted 39 percent of the items in 
one battery, but made up 66 percent of the items in another* Addition varied From 12 
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percent of the items in one battery to 21 percent of the items in another Graphs, 
figures, and tables were used as stimuli in 43 percent of the items in one Battery, but 
were present in only 15 percc-nt of the items in aother. Clearly, these tests differed 
substantially in their mode of preseh tatioh of arithmetic material, the arithmetic 
operations they required students to perform, and in the naure of the material they 
presehtedt Although their total-score intercorrelations would probably be in the high 
eighties, the tests would riot provide equally valid representations of the effectiveness 
of a given basic-skills mathematics program. 
Porter? et al» conclude as follows (p»53S)J 

"Treating practical significance in instructional program evaluation 
requires intimate familiarity with the measures on which effects are estimated 
and their substantive relationship with the goals of the program being 
evaluated* Past attempts to provide general solutions to the size of effect 
problem have relied on stand a>*dized indices which can be estimated and reported 
without any knowledge of what was measured. For this reason these efforts are 
viewed as steps in the wrong direction. Instead, what is called for is a 
procedure whereby the substantive goals of the program, the instructional 
outcomes implied by a test, and the interrelationship between the two are made 
explicit. The procedure should facilitate investigation of treatment-by-item 
interactions and at the same time facilitate a description of the measures in 
sufficient detail to support in fer eh ires regarding practical significance." 
As has been discussed earlier, previous national evaluations of the Follow Through 
program have fallen prey to the error that Porter and his colleagues identify* I r a 
number of widely-Used standardized achievement batteries suitable for use in kindergarten 
through grade three could be equated) proogram sponsors could then select reading 
subtests and mathematics subtests from any of the equated batteries for use in evaluating 
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their Follow Through models. Even with the diversity of content emphases noted above* it 
is not certain that the? curriculum congruence oF any of the subtests would be adequate to 
the evaluation of all (or even most5 Follow Through models* However* it is far more likey 
that suitable content matches between tests and curricula could hi realized if seven or 
eight test batteries were available} than one one battery. was used for evaluation of 
the entire program, as was the case in the SRI-Abt evaluation* This advantage alone might 
justify NIE's investment in another large-scale test eqauating study* 

Haw could the results of a large-scale test equating study be used in Follow Through 
Research? 

The prog '-am of inquiry on early primary education for children from low income 
families envisioned by Schiller, et al* (1930) includes the desire to develop; ; 

"New uses for information systems* including testing and evaluation results, to 
bring better diagnostic and prescriptive information to bear on Follow Through student 
learning needs/' (p»ii;» 

In a variety of other sections, the planning document recognizes the need to develop 
hew strategies arid procedures for assessing the consequences of Follow Through 
interventions* 

A large-scale test equating study has the potential of contributing to a better 
understanding of the outcomes and effects of early primary education programs in a number 
of ways* Some of the research outcomes might provide tools that could Be applied directly 
to the future evaluation of Follow Through and other early primary intervention programs, 
while other research products would be more fundamental and less immediately applicable* 
A good bit of the research would focus on test equating methodology itself, while other 
foci would involve extensions of the research that has eminated from the Anchor Test 
Study* 

It is not cleav that widely used standardized achievement tests appropriate for 
students in kindergarten through grade three are similar enough in their psychometric 
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characteristics to allow success-pl equating* An initial benefit of a test equating study 



at these grade levels wooid be an examination of the feasibility of equating various 
early primary reading tests and various early primary arithmetic tests. Such a 
feasibility analysis wobid extend current knowledge on the degree of parallelism required 
to sustain consistent equating relationships between non-parallel tests, 

The literature on test equating ?Lord, 1950S Angoff, 1971; jaeger, 19S1? suggests 
that* although non-parallei tests can be calibrated; they cannot be equated. The 
distinction is in the consistency of the scaling relationship between the two tests 
across various populations of examinees. Strictly parallel tests are virtually 
interchangeable, in that they measure the same psychometric function with the same degree 
of reliability for all groups of examinees. For strictly parallel tests, then, an 
equating relationship is unique, Cnce established for any population of examinees, it 
holds for all populations, In contrast? non-parallel tests differ either in the 
psychometric function .measured, in reliability ^ or in both characteristics, Although it 
is possible to estabish a fuhctionarcorrespondence between the scales of non-parallel 
tests (e,§», by defining as equivalent, raw scores corresponding to the same standard 
score or the same percentile rank) For a given population, the relationship will not bfc 
consistent for another population* In theory, then, the original Anchor Test Study should 
have been infeasible, 

In practice, the alternate forms of standardized achievement tests produced by 
virtually all test publishers are not strictly parallel, Although they constitute the 
best approximations to parallel farms presently available, they differ to some degree in 
overall difficulty, raw-score variability, and internal consistency reliability, The 
intercorrelations of alternate forms are extremely high, but are often slightly lower 
than their internal consistency reliabilities would allow* Thus strict parallelism is a 
theoretical ideal that is approached, but never realized in practice, Yet alternate forms 




of standardized achievement are routinely equated, arid the equating relationships 

established appear to be aicept-iL'T.y con^iete -it« 

If properly designed; a ia^ge-sca's test equating study at the early primary 
grades would support an analysis of the content similarity requirements of non-parallel 
tests and correlational requirements of non-parallel tests in order to achieve equating 
realtibnships that were sufficiently consistent over populations that differed in 
socioeconomic composition, IQ distribution; racial composition, and other demographic 
descriptors thai: typically distinguish low income students from the majority of students, 
that they -could be used in large-scale evaluation studies or for individual assessments. 
Bianchini and Vale (1975) have completed an initial exploration of the parallelism 
requirements of succussful equating using data from the Anchor Test Study* But their 
findings are limited in two ways. First; -they apply only to reading comprehension and 
vocabulary subtests appropriate for use in grades four through six- Second, the sampling 
procedures used in the Anchor Tes: Study sought proport : onal representation of students 
in various minority ethnic and racial groups* and therefore sampled such students in far 
smaller numbers than majority white students. Differences in equating relationships found 
for white students and black students could as readily be attributed to random 
fluctuations as to consistent bias errors. Nonetheless* Bianchini and Vale concluded that 
the equating relationships developed in the Anchor Test Study were reasonably consistent 
across racial groups, and recommended that the equating tables be used with black 
students and Spenish-sornamed students, as well as with white students. 

The results of an additional equating study could be used tb test the limits of 
generalizability of this finding across grade levels and across subject areas. In 
addition, a new tesr equating study arid be designed so as to sample racial and ethnic 
minorities, and students in other groups of interest; in sufficient nombers to allow 
clear differentiation between random fluctuations, between equating relationships and 
truly stable inconsistencies* 
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One other important consequence of the in terror-relation between tests to be equated 
is the stability (degree of random fluctuation across samples from the same population) 
of equating relationships* An explicit mathematical relationship between inter-test 
correlation and the standard error of equating has been developed For classical iinea :; 
equating (Laird, l?5i?. However* similar relationships are not available for the form of 
equipercentile "equating found to u e most stable in the Anchor Test Study, nor for 
equating procedures that employ item response theory models, An additional test equating 
study would have the potential of greatly extending the empirical basis for examining the 
relationship between test characteristics aand the statistical stabilty of equating 
functions, 

As was the case for tests equated in the Anchor Test Study, publishers of the midst 
widely used standardized achievement tests recommended for use in the early primary- 
grades suggest various combinations of levels of their tests for students in the 
gradespan of interest (Hoepfne- : , i ,r >7£). For example, the same level of tie California 
Achievement Tests is recommended for use in grades one and two, and a different level is 
recommended for use in grade three, However, three different levels of the Iowa Tests of 
Basic Skills are recommended for use in grades one, two and three, If publisher's 
recommendations are followed in the administration of various test levels in the early 
primary grades; an equating study at those grade levels would provide the data necessary 
to conduct research on the consistency and feasibility of vertical test equating across 
levels of standardized tests. As noted above, Slinde and fcinn (1977} i : ?78'j have examined 
vertical equating relationships for reading comprehension tests at grades four through 
six, The methodology they have developed For this type of research could be applied 
directly to tests suitable For use in another gradespan and to tests in another subject 
area. 

The relative utility of classical equating models vs'methnds that employ various 
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item response theory mbde/s is subject to debate, despite extensive research on the? topic 
(Beard and Fettle-, l r '7~'j Marcbi Petersen and 3t=.va : -t f l*39Ji Although Beard and Fett ; s 
concluded that; "Easch equating was comparable to linear equating in the analysis of 
longitudinal trend.; in basic skills achievement."* Marco* et air'Tobrid substantial 
differences between the met hod i in terms of bias erro>- and random e^ror** depending on 
the comparability oF the tesrs being equated and the comparabilty of the samples of 
examinees used to gather data for equaling* Eea^d and Pettie employed items from the 
Florida Assessment teats in commmuriications and mathematics, appropriate for third- 
graders and fifth-graders, whereas Marco, "et al. baaed their analyses on Scholastic 
Aptitude Test items ad samples of high scobl students* Differences in their conclusions 
are likely attributable, at lea^t in part, to their use of items from different tests and- 
examinees in different gradespans. It appears that various equating methods will produce 
similar results under some circumstances and substantially different results under 
others. The number of variables involved in the relationships among various equating 
methods is such that pure-y analytic rules of correspondence are not likely to be 
developed soon. An intriguing question that is yet to be resolved, as noted above* is the 
choice of ah appropriate standard for judging the correctness of an equating 
relationship. That is* if two equating methods produce substantially different results, 
which one should be regarded as correct? Although this logical definitional problem is 
unlikely to Be resolved through another large-scale equating study, the circumstances in 
which various equating methods yield cmparable results or divergent results could be 
further explored using data ffdm an equating study involving basic skills tests with 
early primary students. The methodology that Marco, et aL have applied to the Scholastic 
Aptitude Test da':a base could be employed directly with early primary equating results 
and, in parallel fashion with results from the original Anchor Test Study. The major 
outcomes of this research would be greater understanding of the conditions heeded to 
te two tests successfully! conditions under which jhe equating method might produce 
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stable and consistent equating functions, whereas aother might not* and conditions under 
which various equating methods, including classical and item response theory methods* 
produce virtually identical results. 

Eeyohd its potential value in fostering additional research on test equating; another 
major test Equating study could provide a data Base that would facilitate extension of 
the more general research that has emerged from the original Anchor Test Study* In 
particular, the research cited above that concerns the correlates of academic 
achievement, an area of major research emphasis in the 1981 NIE Eesearch Grants 
Announcement on Testing and Evaluation, could be extended to the early primary grades* 
Although some interesting findings on the correlates of achievement emerged from the 
Anchor Test Study, they were limited by the restricted range of ancillary data collected 
in that study. Since there was no intention i of using Anchor Test Study results- in ah 
investigation of the correlates of ahie/ement at the time the study was designed, the 
only ancillary data collected were those needed to verify the representativeness of 
samples used or to examine the comparability of equating relationships across ser* race, 
and IQ groups. A new equating study could be carefully designed to support a far greater 
range of investigations, including studies of the correlates of achievement, and would 
therefore be of greater value than the Anchor Test Study iri terms of secondary analyses* 

As noted above, another area of research that has made use of Anchor Test Study 
results; is analysis* jf test item bias* Again, a test equating study at the early primary 
grades could be designed so as to facilitate this objective* Careful attention to the 
adequacy of samples of students of various minority groups and IQ groups; and students of 
both sexes would be required, as is the case in many of the research areas already 
discussed, In addition, data tapes would have to be designed to allow recovery of the 
most basic information on students' responses to test items, as was done in the Anchor 
Test Study, 
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In summary* a new tesr-equating stud/ at tne early slsrnentavy grades would be 
consistent with several Follow through objectives, First, it would have the potential of 
contributing to the validity of an evaluation of Follow Through effects (assuming that 
the pas*; and present policy of using stand _vdi red achievement test results as a primary 
indicator of Follow Through success is continued)* Second, such a study would contribute 
to research on methods of assessing pron^-ams for children of low-income families at the 
early primary grades through fundamental resea-ch on test equating methods and through 
research in a variety of ancillary areas. 

RECOMMENDATIONS 

Although the potential benefits of a test equating study at the early primary 
grades have been discussed in some derail? it is not recommended that such a study, be 
initiated as a part of Follow Through research without additional planning and 
investigation. In particular - 9 use of standardized achievement tests in the ea^ly primary 
grades may well have diminished considerably since Hoepfner's study was completed in 
(the year of the USOE-sponsored conference at which he presented the paper published in 
the 1973 reference). IF so* the value of an equating study at the early primary grade 
levels would be reduced, apart from its utility in large-scaile program evaluations that 
incorporated standardized achievement tests. Farther, patterns of test usage may have 
changed since Hospfner identified the eight test batteries used in his study as those 
most widely used. Reliable information on current test usage would be needed prior to 
selection of test batteries for an equating study? and to provide a basis for deciding 
whether or not a test-equating study at the early primary grade levels was warranted, 

If a review of the use of standardized achievement tests in the early primary grades 
suggested that relatively few tesrs were widely used in the basic skills a^eas, a study 
of the feasiDi3icy of equating corresponding subtests of those test batteries would be a 
logical s^p. Administration of pairs of subtests to samples of examinees large enough to 
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estimate iriter-test correlations arid internal consistency reliabilities would provide the 
iH r drTr;aticH needed to make a reij : ried decision on whether to conduct a large-scale 
equating study,' 
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